Device, System and Method for Skin Detection

ABSTRACT

The present invention relates to a device, system and method for skin detection. To enable a reliable, accurate and fast skin detection the proposed device comprises an input unit ( 20 ) for obtaining a sequence of image frames acquired over time, a segmentation unit ( 22 ) for segmenting an image frame of said sequence of image frames, a tracking unit ( 24 ) for tracking segments of the segmented image frame over time in said sequence of image frames, and a clustering unit ( 26 ) for clustering the tracked segments to obtain clusters representing skin of a subject by use of one or more image features of said tracked segments.

CROSS-REFERENCE TO PRIOR APPLICATION

This application claims the benefit of European Patent Application No.14197951.8 filed Dec. 15, 2014. The application is hereby incorporatedby reference herein.

FIELD OF THE INVENTION

The present invention relates to a device, system and method for skindetection.

BACKGROUND OF THE INVENTION

Vital signs of a person, for example the heart rate (HR), therespiration rate (RR) or the arterial blood oxygen saturation (SpO2),serve as indicators of the current health state of a person and aspowerful predictors of serious medical events. For this reason, vitalsigns are extensively monitored in inpatient and outpatient caresettings, at home or in further health, leisure and fitness settings.

One way of measuring vital signs is plethysmography. Plethysmographygenerally refers to the measurement of volume changes of an organ or abody part and in particular to the detection of volume changes due to acardio-vascular pulse wave traveling through the body of a subject withevery heartbeat.

Photoplethysmography (PPG) is an optical measurement technique thatevaluates a time-variant change of light reflectance or transmission ofan area or volume of interest. PPG is based on the principle that bloodabsorbs light more than surrounding tissue, so variations in bloodvolume with every heart beat affect transmission or reflectancecorrespondingly. Besides information about the heart rate, a PPGwaveform can comprise information attributable to further physiologicalphenomena such as the respiration. By evaluating the transmittanceand/or reflectivity at different wavelengths (typically red andinfrared), the blood oxygen saturation can be determined.

Recently, non-contact, remote PPG (rPPG) devices (also called camerarPPG devices) for unobtrusive measurements have been introduced. RemotePPG utilizes light sources or, in general radiation sources, disposedremotely from the subject of interest. Similarly, also a detector, e.g.,a camera or a photo detector, can be disposed remotely from the subjectof interest. Therefore, remote photoplethysmographic systems and devicesare considered unobtrusive and well suited for medical as well asnon-medical everyday applications. This technology particularly hasdistinct advantages for patients with extreme skin sensitivity requiringvital signs monitoring such as Neonatal Intensive Care Unit (NICU)patients with extremely fragile skin or premature babies.

Verkruysse et al., “Remote plethysmographic imaging using ambientlight”, Optics Express, 16(26), 22 Dec. 2008, pp. 21434-21445demonstrates that photoplethysmographic signals can be measured remotelyusing ambient light and a conventional consumer level video camera,using red, green and blue color channels.

Wieringa, et al., “Contactless Multiple Wavelength PhotoplethysmographicImaging: A First Step Toward “SpO2 Camera” Technology”, Ann. Biomed.Eng. 33, 1034-1041 (2005), discloses a remote PPG system for contactlessimaging of arterial oxygen saturation in tissue based upon themeasurement of plethysmographic signals at different wavelengths. Thesystem comprises a monochrome CMOS-camera and a light source with LEDsof three different wavelengths. The camera sequentially acquires threemovies of the subject at the three different wavelengths. The pulse ratecan be determined from a movie at a single wavelength, whereas at leasttwo movies at different wavelengths are required for determining theoxygen saturation. The measurements are performed in a darkroom, usingonly one wavelength at a time.

Apart from the advantage of being fully contactless, cameras (generallycalled imaging devices) provide 2D information, which allows for amulti-spot and large area measurement, and often contains additionalcontext information. Unlike with contact sensors, which rely on thecorrect placement on a specific measurement point/area, the regions usedto measure pulse signals using rPPG technology are determined from theactual image. Therefore, accurate detection of skin areas, reliableunder any illumination conditions becomes a crucial part in theprocessing chain of a camera-based rPPG device and method used forcamera-based vital signs monitoring.

Currently, there are two main approaches known for reliable detectionand tracking of a skin areas.

One approach is based on skin color (RGB-based) detection andsegmentation. Methods according to this approach are fast in bothdetection and tracking of areas with skin color. However, they are notrobust to changes of ambient light color, which will change the color oflight reflected from a skin area, and are not able to detect skin areasunder low illumination conditions or in darkness. Moreover, such methodscannot always differentiate a skin from other objects with the samecolor.

Another approach is based on extracted PPG signals (PPG-based). Methodsaccording to this approach are more robust in differentiating real skinareas and areas of other object of the same skin color. This approachcan be used also to segment the skin areas, which have stronger PPGsignal (the most periodic signal). However, the reliability of theapproach depends on the robustness of PPG signal extractions, thus it isimpacted by motion of a subject and the blood perfusion level.Therefore, if a pulse signal is not periodic or is weak, a camera-basedsystem will have difficulties to detect the segment the skin areas.Moreover, the approach is also computationally expensive.

It should be noted that the detection of skin area is not only ofinterest in the field of vital signs detection based on the rPPGtechnology, but also in other technical fields, e.g. in remote gamingapplications using camera technology to recognize gestures of theplayer, face detection, security (robust detection of a person usingsurveillance cameras and detection of a person wearing a mask ordistinguishing real faces from a realistic mask in a cameraregistration), etc.

A problem with surveillance camera images is that it is difficult todistinguish between a face and a mask or a picture of a face. A problemwith camera-based vital signs monitoring is that the vital signs canonly be extracted from living tissue, usually some skin-area, butcurrently this area is most often manually selected from the video andconsequently tracked, or one has to rely on a face detector thattypically fails on side views of a face, and certainly does not giveanything useful for non-facial skin in the video. Hence, there is stilla need for ways to reliably, accurately and quickly find every usefulskin surface.

SUMMARY OF THE INVENTION

It is an object of the present invention to provide a device and acorresponding method as well as a system which allow a reliable,accurate and fast detection of skin.

In a first aspect of the present invention a device for skin detectionis presented comprising

-   -   an input unit for obtaining a sequence of image frames acquired        over time,    -   a segmentation unit for segmenting an image frame of said        sequence of image frames,    -   a tracking unit for tracking segments of the segmented image        frame over time in said sequence of image frames,    -   a clustering unit for clustering the tracked segments to obtain        clusters representing skin of a subject by use of one or more        image features of said tracked segments.

In a further aspect of the present invention a corresponding method forskin detection is presented.

In a still further aspect of the present invention a system for skindetection is presented comprising

-   -   an imaging unit for acquiring a sequence of image frames over        time and    -   a device as disclosed herein for skin detection based on the        acquired sequence of image frames.

In yet further aspects of the present invention, there are provided acomputer program which comprises program code means for causing acomputer to perform the steps of the method disclosed herein when saidcomputer program is carried out on a computer as well as anon-transitory computer-readable recording medium that stores therein acomputer program product, which, when executed by a processor, causesthe method disclosed herein to be performed.

Preferred embodiments of the invention are defined in the dependentclaims. It shall be understood that the claimed methods, processor,computer program and medium have similar and/or identical preferredembodiments as the claimed system and as defined in the dependentclaims.

The present invention solves the problems and disadvantages of the knownmethods and devices using a three step approach. First, in the analysiswindow (time-interval) an image (mostly the first of the window) issegmented. Next, individual segments are tracked over time (resulting intracked segments over time which each may also be considered as avolume). Said tracking is performed by use of one or more features,including features describing the temporal domain, which e.g. have beenextracted from the tracked segments (volumes). Finally, a clustering isperformed of said tracked segments based on said features identifyingone or more clusters that contain living tissue, i.e. skin area, andpossibly one or more clusters that do not contain any living tissue.This allows a reliable, accurate and fast detection of skin in asequence of image frames as typically acquired by a video camera.

In an embodiment said segmentation unit is configured to over-segmentthe image frame that is used for segmentation. Generally, an assumptionis made with respect to the (minimum) size of the living tissue (skinarea) that shall be detected in the image frames. The proposed“over-segmentation” aims at finding initial segments that are smallerthan this minimum size. Often, in segmentation, there is some parameterto be chosen that determines the number of segments that are found. Forexample in a feature point detector it is possible to set a minimumquality of a feature point which affects the number of features that canbe found. Preferably, said segmentation unit is configured toover-segment the image frame that is used for segmentation based oncolor, position and/or texture properties.

The aim of the over-segmentation is to have at least a number ofsegments that only contain living tissue and no background as couldoccur with larger segments resulting from the triangulation. At the sametime, the segments should be large enough to allow for extraction of apulse-signal with sufficient SNR to be used as a feature in clusteringthe segments belonging to tissue of the same living being.

In another embodiment said clustering unit is configured to use temporalcolor variations of the tracked segments as feature for clustering thetracked segments. For instance, said clustering unit may be configuredto perform a Fourier transformation of the spatially combined colorvalue of at least part of the pixels in a tracked segment and to clustertwo tracked segments into a single cluster if their amplitude peakfrequency has substantially the same value and phase. Hence, to know iftwo tracked segments belong to the same cluster (i.e. to living being)the frequency and the phase of their temporal color variations may becompared. This provides for a rather simple but efficient way ofclustering.

In another embodiment said clustering unit is configured to compute theinner product between normalized time signals of different trackedsegments and to cluster said tracked segments into a single cluster ifsaid inner product exceeds a predetermined threshold. Said time signalsare preferably color signals in a time window, which may be normalizedby their mean in that window. The higher the value of the inner productis, the more similar the tracked segments are. This provides for anotherrather simple but effective way of clustering.

In still another embodiment said clustering unit is configured to usethe hue of pixels of the tracked segments as feature for clustering thetracked segments, which provides for still another rather simple buteffective way of clustering. It would also be possible to require that“members” of a cluster are connected to other members. However,different skin parts may not necessarily have the same hue and also canbe unconnected due to clothing etc. so that such features are preferablyused in addition to other features for clustering.

Preferably, said segmentation unit, said tracking unit and saidclustering unit are configured to repeatedly perform segmentation,tracking and clustering, wherein the sequence of acquired image framesused for subsequent repetitions overlap in time. Hence, successive(time) windows of image frames are preferably processed sequentially,which successive windows are partially overlapping.

The segmentation unit may be configured to perform a feature pointdetection, in particular a Harris corner detection, or a triangulation,in particular a Delaunay triangulation. A Harris corner detector is e.g.described in C. Harris and M. Stephens, “A combined corner and edgedetector”, Proceedings of the 4th Alvey Vision Conference, 1988, pp.147-151. It considers the differential of a corner score with respect todirection directly, instead of using shifted patches. A Delaunaytriangulation is e.g. described in Delaunay, Boris: “Sur la sphère vide.A la mémoire de Georges Voronoï”, Bulletin de l'Académie des Sciences del'URSS, Classe des sciences mathématiques et naturelles, No. 6: 793-800,1934. A Delaunay triangulation for a set P of points in a plane is atriangulation DT(P) such that no point in P is inside the circumcircleof any triangle in DT(P). Delaunay triangulations maximize the minimumangle of all the angles of the triangles in the triangulation; they tendto avoid skinny triangles. These methods are advantageous (but not theonly) options for performing the segmentation.

In still another embodiment said tracking unit is configured to trackthe segments of the segmented image frame using motion estimation. Otheroptions may be the use of any tracking based on appearance, e.g. usingface detection or object detection which allows tracking of an imagepart.

Still further, the device may further comprise a vital signs detectorfor detecting vital signs of a subject based on image information fromdetected skin areas within said sequence of image frames. Preferably,the rPPG technology is applied for obtaining the vital signs.

BRIEF DESCRIPTION OF THE DRAWINGS

These and other aspects of the invention will be apparent from andelucidated with reference to the embodiment(s) described hereinafter. Inthe following drawings

FIG. 1 shows a schematic diagram of an embodiment of a system accordingto the present invention,

FIG. 2 shows a schematic diagram of an embodiment of a device accordingto the present invention,

FIG. 3A shows an image frame illustrating triangulation as used in anembodiment of the present invention,

FIG. 3B shows an image frame illustrating triangulation as used in anembodiment of the present invention,

FIG. 4A shows a diagram of normalized color channels related to region Aof FIG. 3B,

FIG. 4B shows a diagram of pulse signals corresponding to region A ofFIG. 3B,

FIG. 4C shows a diagram of normalized spectra of pulse signals and ofauto-correlation of pulse signals corresponding to region A of FIG. 3B,

FIG. 5A shows a diagram of normalized color channels related to region Bof FIG. 3B,

FIG. 5B shows a diagram of pulse signals corresponding to region B ofFIG. 3B,

FIG. 5C shows a diagram of normalized spectra of pulse signals and ofauto-correlation of pulse signals corresponding to region B of FIG. 3B,

FIG. 6A shows a diagram of normalized color channels related to region Cof FIG. 3B,

FIG. 6B shows a diagram of pulse signals corresponding to region C ofFIG. 3B,

FIG. 6C shows a diagram of normalized spectra of pulse signals and ofauto-correlation of pulse signals corresponding to region C of FIG. 3B,

FIG. 7A shows a diagram of normalized color channels related to region Dof FIG. 3B,

FIG. 7B shows a diagram of pulse signals corresponding to region D ofFIG. 3B,

FIG. 7C shows a diagram of normalized spectra of pulse signals and ofauto-correlation of pulse signals corresponding to region D of FIG. 3B,

FIG. 8A shows scatter plots of the feature space from all regions,

FIG. 8B shows scatter plots of the feature space from regions passing aprefiltering stage in clustering,

FIG. 9A shows an example of a distribution of points in a 2-dimensionalspace within the radius of a circle,

FIG. 9B shows an example of the partitioning of a distribution of pointsin a 2-dimensional space using a given range and minPts,

FIG. 10A shows scatter plots of feature space of regions that pass aprefiltering step,

FIG. 10B shows the result of automatic skin detection for an examplevideo sequence,

FIG. 11 shows an illustration of the overlap-add procedure, and

FIG. 12 shows a flow chart of an embodiment of a method according to thepresent invention.

DETAILED DESCRIPTION OF THE INVENTION

FIG. 1 shows a schematic diagram of a first embodiment of a system 1 forskin detection according to the present invention. The system 1comprises an imaging unit 10 for acquiring a sequence of image frames ofa scene over time. The scene includes, in this example, a patient 2lying in a bed 3, e.g. in a hospital room or other healthcare facility,but may also be the environment of a neonate or premature infant, e.g.lying in an incubator/warmer, or a person at home or in a differentenvironment. The imaging unit 10 is particularly a camera (also referredto as detection unit or as camera-based or remote PPG sensor), which isconfigured to obtain images of the scene, preferably including skinareas 4 of the patient 2. In an application of the device for obtainingvital signs of the patient 2, the skin area 4 is preferably an area ofthe face, such as the cheeks or the forehead, but may also be anotherarea of the body, such as the hands or the arms.

The image frames captured by the camera may particularly correspond to avideo sequence captured by means of an analog or digital photosensor,e.g. in a (digital) camera. Such a camera usually includes aphotosensor, such as a CMOS or CCD sensor, which may also operate in aspecific spectral range (visible, IR) or provide information fordifferent spectral ranges. The camera may provide an analog or digitalsignal. The image frames include a plurality of image pixels havingassociated pixel values. Particularly, the image frames include pixelsrepresenting light intensity values captured with differentphotosensitive elements of a photosensor. These photosensitive elementsmay be sensitive in a specific spectral range (i.e. representing aspecific color). The image frames include at least some image pixelsbeing representative of a skin portion of the person. Thereby, an imagepixel may correspond to one photosensitive element of a photo-detectorand its (analog or digital) output or may be determined based on acombination (e.g. through binning) of a plurality of the photosensitiveelements.

The system 1 further comprises a device 12 according to the presentinvention for skin detection based on the acquired sequence of imageframes. An embodiment of the device 12 will be explained in more detailbelow with reference to FIG. 2.

The system 1 may further comprise a controller 14 for controlling theother elements of the system, a user interface 16, such as a keyboardand/or a display, for entering commands for controlling the system 1and/or outputting generated information, such as obtained vital signs.These elements 14 and 16 may be part of the device 12, as shown in FIG.1, or may be separate elements.

A schematic diagram of an embodiment of a device 12 for skin detectionaccording to the present invention is shown in FIG. 2. The device 12comprises an input unit 20 for obtaining a sequence of image framesacquired over time, either directly from the imaging unit 10 or from astorage (not shown). The input unit 20 may be a data interface for wiredor wireless reception of data, e.g. via a network or via a direct link.

The device 12 further comprises a segmentation unit 22 for segmenting animage frame of said sequence of image frames, a tracking unit 24 fortracking segments of the segmented image frame over time in saidsequence of image frames and a clustering unit 26 for clustering thetracked segments to obtain clusters representing skin of a subject byuse of one or more image features of said tracked segments. The functionof these units 22, 24, 26 will be explained in more detail below.

The device 12 optionally further comprises a vital signs detector 28 fordetecting vital signs of the subject 2 within the scene based on imageinformation from detected skin areas within a sequence of imagesacquired by the imaging unit 10. The vital signs detector 28 preferablyapplies the remote PPG technique to obtain vital signs from said imageinformation, e.g. heartbeat, SpO2, etc.

The different units of the device 12 may be configured as dedicatedhardware elements, but may also be configured as processor or computer,which is programmed accordingly. The device 12 may be configured asintegrated device including all these units, e.g. in a common housing(e.g. in the housing of the imaging unit 10) or as distributed device,i.e. implemented as separate elements and units. Preferably, the device12 is configured as computer programmed with an appropriate for carryingout the functions of the units of the device 12.

Preferably, the segmentation unit 22 over-segments an image, i.e.segments the image frame into smaller segments as conventionallyrequired or used for segmentation. Although the (over)-segmentation canuse alternative features, in an embodiment a triangulation of the imagebased on Harris feature points is used, as this elegantly solves thesegment-tracking over time with low computational effort. Cornerdetection, as used in a Harris detector, works on the principle that ifa small window is placed over an image, if that window is placed on acorner then if it is moved in any direction there will be a large changein intensity. If the window is over a flat area of the image then therewill be obviously be no intensity change when the window moves. If thewindow is over an edge there will only be an intensity change if thewindow moves in one direction. If the window is over a corner then therewill be a change in all directions, and therefore it will be known thatthere must be a corner. A Harris corner detector measures the strengthof detected corners, and only marks those above a given strength asactual corners.

Alternatively, a dense motion estimation could have been performed,assigning a motion vector to every pixel or region in the image, and usethese motion vectors to track any (e.g. color) segmentation.

Further, it is even possible to consider every pixel a segment, providedthe pixels have a low enough (quantization) noise-level. To achieve suchgood noise level, pixels may be grouped into blocks (fixed grid) asinitial segmentation.

As a further alternative the image could be down-scaled to a lowerspatial resolution such that the resulting low-resolution pixels have agood SNR due to the spatial averaging of the down-scaling filter.

Whatever the segmentation and tracking method used, different trackedsegments are successively clustered based on minute temporal variationsof the color that are characteristic for living tissue and caused bycardiac activity. Using the frequency and phase of these colorvariations, tissue from multiple subjects can be distinguished anddisjunct tissue parts belonging to the same individual end up in thesame cluster.

In the following a detailed embodiment of the proposed method shall beexplained comprising three main stages: (1) region tracking (2) featureextraction and (3) region clustering. First, the first image in apredefined interval is divided in adjacent regions which are trackedduring the remaining images in the interval. Second, discriminativeinformation related to skin is extracted from these regions as a featurevector. Third, a clustering of regions is performed using the featurevector to determine the ROI.

As a tradeoff between estimation accuracy and computational complexity,the images in a predefined interval are segmented in adjacent regions.To obtain accurate pulse signals, the relocation of all points in theregion is determined in order to compensate for movement artifacts;however this can prove to be a very time consuming task. As analternative solution, the trajectory of solely boundary points can betracked assuming that relocation of the interior points is similar. Forthis approach, points suitable for tracking are located in the initialimage of the interval and their trajectories are estimated for theentire interval. Regions shaped as triangles are established usingtriangulation derived from these points.

Suitable points for tracking are only located in areas of the imagecontaining the complete motion information. This restriction is known asthe aperture problem and can be resolved by selecting points where twoedges intersect (corner points). In this embodiment, the cornerdetection method proposed by Harris & Stephens as e.g. described in theabove cited document is implemented for the detection of trackableinterest points. For each point p=[x y]^(T) in the image, a cornerresponse measure R is computed which is defined as:

R=Det(M)−k[Tr(M])²  (1)

where k is a tunable sensitivity parameter and M is a 2×2 symmetricmatrix given by:

$\begin{matrix}{M = {\sum\limits_{u,{v \in {N{({x,y})}}}}\; {{w\left( {u,v} \right)}\begin{bmatrix}I_{x}^{2} & {I_{x}I_{y}} \\{I_{x}I_{y}} & I_{y}^{2}\end{bmatrix}}}} & (2)\end{matrix}$

where I_(x) and I_(y) denote the image derivatives in the horizontal andvertical direction within a neighborhood N(x, y). The weighting functionw(u, v) is shaped as a circular Gaussian to smooth the corner response.By applying a threshold, the points with a corner response R greaterthan a certain value can be considered as corner points and aretherefore suitable for tracking. In FIG. 3A, interest points suitablefor tracking according the corner detection are shown for a videosequence containing a hand.

After detection the interest points in the initial frame of theinterval, these points are tracked during the remaining images in theinterval. To find the trajectory of a point p=[x, y]^(T) thedisplacement vector Δ=[d_(x), d_(y)] between consecutive images isobtained which minimizes the following equation:

$\begin{matrix}{{e\left( {d_{x},d_{y}} \right)} = {\sum\limits_{u,{v \in {N{({x,y})}}}}\; \left\lbrack {{I\left( {u,v} \right)} - {J\left( {{u + d_{x}},{v + d_{y}}} \right)}} \right\rbrack^{2}}} & (3)\end{matrix}$

Under the assumption that the displacement of the image contents betweenthe two images I and J is not too large and is approximately constantwithin a neighborhood N (x, y), this optimization problem can beefficiently solved using e.g. a method as described in B. D. Lucas, “AnIterative Image Registration Technique with an Application to StereoVision”, vol. 130, pp. 121-130, 1981. To accommodate for largedisplacement, an image pyramid implementation of the Lucas Kanadetracker described in J-Y Bouget, “Pyramidal implementation of the AffineLucas Kanade feature tracker, Description of the algorithm”, IntelCorporation, Microprocessor Research Labs may be used in thisembodiment. In this approach, the displacement vector found on an upperpyramid level is propagated onto the lower levels and refined using thehigh frequency details.

There are two cases where an interest point might be considered as“lost”. The first case is that the estimated position of an interestpoint falls outside the boundaries of the image and can be readilyverified. The second case is when an interest point is disappearing dueocclusion and may be declared “lost” if the optimal ε({tilde over(d)}_(x),{tilde over (d)}_(y)) exceeds a certain threshold. However,since the tracking is based on consecutive pairs of images, thedisplacement errors are inherently small and choosing a threshold is achallenging task. As occlusion areas cause unreliable displacementestimates, a symmetry check is included where the actual positionp_(I)=[x y]^(T) is compared with the estimated {tilde over(p)}_(I)=[{tilde over (x)} {tilde over (y)}]^(T) resulting from the‘backward’ displacement estimate using p_(J)=[x+{tilde over(d)}_(x),y+{tilde over (d)}_(x)]^(T). In case the difference in theselocations is larger than a certain value, currently set to 2 pixels, theinterest point is considered “lost” and is not tracked further.

As discussed above, some interest points can disappear when movingoutside the image or due to occlusion. Deriving triangles using thedetected interest points and tracking these triangles is a valid option;however, in doing so there is a risk to have triangles discarded if oneinterest point disappears. The triangulation is therefore performedusing interest points that are successfully tracked during the completeinterval. In order to avoid triangles with a large ratio between thelargest and shortest edge, i.e. skinny triangles, Delaunay triangulationis applied. This method attempts to maximize the minimum angle of allthe angles of the triangles in the triangulation. FIG. 3B shows theresulting regions for the example sequence.

Following the region tracking, R, G and B traces are composed for eachregion by concatenating the average pixel values of every image in theinterval. The pulse signal for all regions in the interval is computedusing the method X_(s)minαY_(s) as described in G. de Haan and V.Jeanne, “Robust pulse-rate from chrominance-based rPPG”, IEEEtransactions on bio-medical engineering, no. c, pp. 1-9, Jun. 2013. Thischrominance-based method is using a combination of two orthogonalchrominance signals X=R−G and Y=0.5R−0.5G−B and is therefore capable ofeliminating specular reflection changes due to movement assuming of awhite light source. In order to enable correct functioning with coloredlight sources, skin-tone standardization is applied resulting in thefollowing equation:

X _(s)=3R _(n)+2G _(n)

Y _(s)=1.5R _(n) +G _(n)−1.5B _(n)  (4)

where R_(n), G_(n) and B_(n) are normalized color channels by dividingthe samples by their mean over the interval to provide a pulse signalthat is independent of the brightness of the light source. Thedifference between X_(s) and Y_(x) is considered as the pulse signal:

$\begin{matrix}{{S = {X_{f} - {\alpha \; Y_{f}}}}{with}} & (5) \\{\alpha = \frac{\sigma \left( X_{f} \right)}{\sigma \left( Y_{f} \right)}} & (6)\end{matrix}$

where σ(X_(f)) and σ(Y_(f)) are the standard deviations of X_(f) andY_(f) which are the band-passed filtered versions of X_(s) and Y_(s).The normalized color traces R_(n), G_(n) and B_(n) and resulting pulsesignals are given in FIGS. 4 to 7 for regions indicated in FIG. 3B.After obtaining the set of temporal color traces R_(n), G_(n) and B_(n)shown in FIGS. 4A, 5A, 6A, 7A, and pulse signal S shown in FIGS. 4B, 5B,6B, 7B for each region in the interval, discriminant features forclustering are extracted. These features should summarize uniquecharacteristics of skin regions. FIGS. 4C, 5C, 6C, 7C show normalizedspectra of pulse signals V and normalized spectra W of auto-correlationof pulse signals.

Similar to the existing pixel based skin detection methods describedabove the fact that the color properties for skin regions within a givenimage are at close range is preferably used. A disadvantage of using theRGB color space representation is its sensitivity to illuminationintensity changes. To enhance robustness towards changes in illuminationand shadows, the color space can be transformed into a color space wherethe intensity is separated from the intrinsic information of color.Therefore, the Hue-Saturation-Value (HSV) color space is used in anembodiment where the value component which is related to the intensityis and the saturation are disregarded. The hue is computed using thefollowing equation:

$\begin{matrix}{H = \left\{ \begin{matrix}{{60 \times \left( {\frac{G - B}{C}{mod}\; 6} \right)},} & {{{if}\mspace{14mu} M} = R} \\{{60 \times \left( {\frac{B - R}{C} + \; 2} \right)},} & {{{if}\mspace{14mu} M} = G} \\{{60 \times \left( {\frac{R - G}{C} + \; 4} \right)},} & {{{if}\mspace{14mu} M} = B}\end{matrix} \right.} & (7)\end{matrix}$

where the average pixel values {R, G, B}ε[0, 1] and chroma C is definedas:

C=M−m  (8)

where

M=max(R,G,B)

m=max(R,G,B)  (9)

Furthermore, similar skin regions corresponding to a particular bodypart are closely located in each instantaneous frame. The position ofthe geometric center P of each region is computed by averaging over thepositions of the corner points belonging to the region:

$\begin{matrix}{P = {\frac{1}{N}{\sum\limits_{n = 0}^{N - 1}\; P_{n}}}} & (10)\end{matrix}$

To describe the periodicity of the pulse signal S, a spectral analysismay be performed. Therefore, the Discrete Fourier Transfer (DFT) may beused which transforms the time domain signal by correlating the signalwith cosines and sines of different frequencies as follows:

$\begin{matrix}{{{\hat{S}\lbrack k\rbrack} = {\sum\limits_{n = 0}^{N - 1}\; {{S\lbrack n\rbrack}^{{- {j2\pi}}\frac{kn}{N}}}}},{0 \leq k \leq {N - 1}}} & (11)\end{matrix}$

where N is the number of frames in the given interval. To retrieve anacceptable spectral resolution, the length of the interval is fixed atN=128 (6.4 sec) and the resulting spectrum is interpolated to obtain 512FFT bins. FIGS. 4C, 5C, 6C, 7C show examples of S_(i) where iε{A, B, C,D} in the spectral domain for the different regions in FIG. 3B. Thespectrum of S_(A) consists of a dominant frequency whereas thedistribution of frequencies for S_(B) is significantly wider. Inpractice, movement and imperfections of the camera introduce severalnoise components into the pulse signal. To reduce the influence of noiseon the frequency spectrum, the DFT of the auto-correlation is computedsince this will suppressed the noise while retaining the periodicity ofthe pulse signal.

FIGS. 4C, 5C, 6C, 7C show examples of the auto-convolution of S_(i)where iε{A, B, C, D} in the spectral domain for the different regions inFIG. 3B. The auto-correlation is computed using the following equation:

$\begin{matrix}{{{r\lbrack\tau\rbrack} = {\sum\limits_{n = 0}^{N - 1}\; {{S\lbrack n\rbrack}{S\left\lbrack {n - \tau} \right\rbrack}}}},\mspace{14mu} {{{- \frac{1}{2}}N} < \tau < {\frac{1}{2}N}}} & (12)\end{matrix}$

The frequency corresponding to the largest peak in the spectrum isdenoted as F and is considered the HR frequency in regions that containskin. This feature can be derived as follows:

$\begin{matrix}{F = {\hat{k} \cdot \frac{f_{s}}{N}}} & (13)\end{matrix}$

where f_(s) is the sample frequency and f_(s)/N is the resolution in thespectral domain. Index {circumflex over (k)} of the frequency bincorresponding to the largest peak is located as follows:

$\begin{matrix}{\hat{k} = {\arg {\max\limits_{k}\left\{ {\hat{R}\lbrack k\rbrack} \middle| {k \in \left\lbrack {0,{\left( {N - 1} \right)/2}} \right\rbrack} \right\}}}} & (14)\end{matrix}$

The phase angle θ of the pulse signal can be derived as follows:

$\begin{matrix}{\theta = {\arctan \left( \frac{{Im}\left( {S\left\lbrack \hat{k} \right\rbrack} \right)}{{Re}\left( {S\left\lbrack \hat{k} \right\rbrack} \right)} \right)}} & (15)\end{matrix}$

where θε[−π, π].

The region can be represented as a point p=[F, θ, H, P_(g)] in amultidimensional feature space. FIG. 8A shows scatter plots of thefeature space for the regions from the example video sequence whereP_(g) is separated in the horizontal and vertical position. In thesescatter plots it can be seen that there is a dense area of pointscorresponding to the skin regions.

To reduce the amount of regions which are forwarded to the clustering, aprefiltering stage is preferably applied where regions with certaincharacteristics are readily identified as background. Therefore, severalconditions may be used for regions which are necessary in order to beselected for the clustering.

As a first condition color traces may be used. In the above, it has beenassumed that the region content is fixed due to the application ofregion tracking. However, regions containing object boundaries can stillbe subject to changes in content due to movement. An example of suchregions is indicated with “C” in FIG. 3B. The amplitudes of theresulting normalized color traces are significantly larger than thecolor changes caused by blood pulsation as can be seen in FIG. 4A. Onthe other hand, the color changes in regions containing only backgroundare due to noise and exhibit small amplitudes. An example of abackground region is shown in FIG. 3B indicated with “B” and theresulting color changes can be seen in FIG. 4A. The peak amplitude forthe normalized color traces Cε{R_(n), G_(n), B_(n)} is computed using:

δ_(C,max)=max{C[n]|nε[0,N−1]}  (16)

where N is the interval length. The regions containing peak amplitudesthat are outside the range [0.005, 0.15] are identified as background.Furthermore, skin regions exhibit different amplitudes while remainingin-phase due to the different blood absorption rates of individual colorchannels.

In order to assess the amplitude differences between the color traces,the normalized Sum of Absolute Differences (SAD) is computed using thefollowing equation:

$\begin{matrix}{\delta_{SAD} = \frac{\sum_{n}{{{x\lbrack n\rbrack} - {y\lbrack n\rbrack}}}}{\sum_{i}{{x\lbrack n\rbrack}}}} & (17)\end{matrix}$

and regions outside the range δ_(SAD)ε[0.05, 0.5] are identified asbackground. To find phase similarities of the color traces, theNormalized Correlation Coefficient (NCC) between the color traces iscomputed using the following equation:

$\begin{matrix}{\delta_{NCC} = \frac{\sum_{n}{{x\lbrack n\rbrack} \cdot {y\lbrack n\rbrack}}}{\sqrt{\sum_{n}{{x\lbrack n\rbrack}^{2} \cdot {\sum_{n}{y\lbrack n\rbrack}^{2}}}}}} & (18)\end{matrix}$

The regions containing peak amplitudes that are outside the rangeδ_(NCC)ε[0.5, 0.99] are identified as background.

As a first condition pulse signal may be used. To determine the presenceof a dominant frequency, the ratio between the two largest peaks in thespectrum is calculated as:

$\begin{matrix}{{PR} = \frac{P_{\max \; 2}}{P_{\max \; 1}}} & (19)\end{matrix}$

for which 0≦PR≦1 where the upper-bound corresponds to the case that theamplitude of the peaks are identical and the lower bound corresponds tothe case there is exactly one frequency present. The regions wherePR>0.6 are identified as background. Furthermore, the peak amplitude ofthe pulse signal is computed as follows:

δ_(S,max)=max{S[n]|nε[0,N−1]}  (20)

and the regions where δ_(S,max)<0.005 are identified as background. Theparameters discussed here are chosen empirically. In FIG. 8B, thefeature space of the example video sequence is shown where only theregions selected for clustering are shown. As can be seen, the densearea of points is still present whereas a greater part of the backgroundregions are removed.

After eliminating the regions which can readily be identified asbackground, the remaining regions are grouped based on their features.In the scatter plots of the feature space for the example videosequence, an area of points can be clearly recognized for which thedensity is considerably higher than outside the area. These points inthe feature space correspond to regions in the interval which containskin. The points located outside the dense area correspond to backgroundregions and are considered noise. A clustering method which relies on adensity-based notion of clusters is the Density-Based Spatial Clusteringof Applications with Noise (DBSCAN) method as described in M. Ester, H.Kriegel, J. Sander, and X. Xu, “A density-based algorithm fordiscovering clusters in large spatial databases with noise”, KDD, 1996.This method separates points into three different types using a densitythreshold with two parameters, ε specifying a range and minPts denotinga minimal number of points. A point is called a cluster core point if ithas at least minPts points within the range ε. The points lacking thisrequirement yet having a core point within the range belong to thatcluster and are called border points. The remaining points do not belongto any cluster and are called noise points. FIG. 9A shows an example ofthe partitioning of a distribution of points in a 2-dimensional spacewhere E is the radius of a circle. FIG. 9B shows the results ofpartitioning the points with E=1 and minPts=6. The core points cp have aminimum of 6 neighboring points within a radius of 1 (circle). Theborder points by have maximally 5 neighboring points np including a corepoint cp within a radius of 1 (circle). The remaining points are noisepoints np.

This density-based approach to clustering can be increased to any higherdimensional space. Furthermore, the range can be shaped arbitrary byusing a separate distance parameter for every feature in the space. Todetermine whether regions are located within in the range of each otheris computed as follows:

$\begin{matrix}{N_{i,j} = \left\{ \begin{matrix}{true} & {{{p_{i} - p_{j}}} \leq ɛ} \\{false} & {elsewhere}\end{matrix} \right.} & (21)\end{matrix}$

where the range ε=[3, 10, 10, 100] is empirically determined. Theminimum number of points in a cluster is defined as minPts=6. FIG. 10shows the clustering results for the example video sequence using theseparameters. FIG. 10A shows scatter plots of the feature space of regionsthat pass the prefiltering step. The points indicated by circles belongto the skin regions. FIG. 10B shows clustering results for example videosequence. It can be seen that the most skin regions are determinedcorrectly.

It is possible that some regions containing skin are falsely identifiedas background regions as a consequence of the preprocessing. Theseregions lack the conditions necessary for clustering but still maycontain strong pulse signals. In order to retrieve these particular skinregions, a cluster growing is performed. For this growing procedure, theaverage pulse signal of a cluster is compared to the pulse signal ofsurrounding regions using (18). When the pulse signal of a regioncorresponds significantly (γ_(NCC)>0.5) to the pulse signal of thecluster, the region is added to the cluster. Since the pulse signals ofthe regions in the clusters are likely to have different amplitudes, thepulse signals are normalized using:

$\begin{matrix}{{Sn}_{r} = \frac{S_{r}}{\sigma \left( S_{r} \right)}} & (22)\end{matrix}$

where σ(S_(r)) is the standard deviation of S_(r). To minimize theeffects of outliers caused by noise we combine the pulse signals of theregions using the alpha-trimmed mean. The average pulse signal of thecluster is obtained by eliminating the highest and the lowest values ofeach time period and averaging over the remaining values, i.e. for everysample iε{1 . . . 128} and for all regions rε{1 . . . M} within thecluster, the data Sn_(r,i) is ordered in such a way that∀_(i)∀_(re(1 . . . M−1)) holds that Sn_(r,i)≦Sn_(r+1,i). The averagepulse signal of the cluster is then obtained as follows:

$\begin{matrix}{S_{c,i} = {\frac{1}{\left( {1 - \alpha} \right)M}{\sum\limits_{r}\; {Sn}_{r,i}}}} & (23)\end{matrix}$

where

$\left( {{\frac{\alpha}{2}M} + 1} \right) \leq r \leq {\left( {1 - \frac{\alpha}{2}} \right)M}$

and 0≦α≦1 where α is set to 0.5.

To further increase the robustness of the proposed method, i.e. ensuringthe ROI is properly detected throughout the video sequences, theresulting pulse signal for the complete video sequence is compared to areference signal. The pulse signal throughout the complete videosequence can be obtained by repeating the detection of the ROI andconcatenating the resulting pulse signals. To prevent discontinuity atthe edges of intervals, the ROI estimation is performed in anoverlapping fashion and the resulting pulse signals are stitchedtogether using Hann windowing on individual intervals:

$\begin{matrix}{S_{i} = {\sum\limits_{N}\; {{wh}_{N,i}S_{N,i}}}} & (24)\end{matrix}$

where S_(N,i) is the pulse signal in image i obtained from the Nth ROIestimator and wh_(N,i) is the Hann windowing function centered ininterval N and zero outside the interval:

wh _(N,i)=0.5−0.5 cos(2πi/interval)  (25)

An illustration of the overlap-add procedure is shown in FIG. 11. Ininterval where the ROI could not be determined, the interval pulsesignal is replaced with a signal consisting of zeros to ensurecontinuity of the output pulse signal. Furthermore, in the case thatmultiple clusters are found, the cluster containing the most regions isconsidered as ROI.

FIG. 12 shows a flow chart of an embodiment of a method according to thepresent invention. In a first step S10 a sequence of image framesacquired over time is obtained. In a second step S12 an image frame ofsaid sequence of image frames is segmented. In a third step S14 segmentsof the segmented image frame over time are tracked in said sequence ofimage frames. In a fourth step S16 the tracked segments are clustered toobtain clusters representing skin of a subject by use of one or moreimage features of said tracked segments.

The proposed device, system and method can be used for continuousunobtrusive monitoring of PPG related vital signs (e.g. heartbeat, SpO2,respiration), and can be used in NICU, Operation Room, or General Ward.The proposed device, system and method can be also used for personalhealth monitoring. Generally, the present invention can be used in allapplications where skin needs to be detected in an image of a scene andneeds particularly be distinguished from non-skin, e.g. in surveillanceapplications such as access control or safety monitoring.

While the invention has been illustrated and described in detail in thedrawings and foregoing description, such illustration and descriptionare to be considered illustrative or exemplary and not restrictive; theinvention is not limited to the disclosed embodiments. Other variationsto the disclosed embodiments can be understood and effected by thoseskilled in the art in practicing the claimed invention, from a study ofthe drawings, the disclosure, and the appended claims.

In the claims, the word “comprising” does not exclude other elements orsteps, and the indefinite article “a” or “an” does not exclude aplurality. A single element or other unit may fulfill the functions ofseveral items recited in the claims. The mere fact that certain measuresare recited in mutually different dependent claims does not indicatethat a combination of these measures cannot be used to advantage.

A computer program may be stored/distributed on a suitablenon-transitory medium, such as an optical storage medium or asolid-state medium supplied together with or as part of other hardware,but may also be distributed in other forms, such as via the Internet orother wired or wireless telecommunication systems.

Any reference signs in the claims should not be construed as limitingthe scope.

1. Device for skin detection comprising: an input unit for obtaining asequence of image frames acquired over time, a segmentation unit forsegmenting an image frame of said sequence of image frames, a trackingunit for tracking segments of the segmented image frame over time insaid sequence of image frames, a clustering unit for clustering thetracked segments to obtain clusters representing skin of a subject byuse of one or more image features of said tracked segments.
 2. Device asclaimed in claim 1, wherein said segmentation unit is configured toover-segment the image frame that is used for segmentation.
 3. Device asclaimed in claim 1, wherein said clustering unit is configured to usetemporal color variations of the tracked segments as feature forclustering the tracked segments.
 4. Device as claimed in claim 3,wherein said clustering unit is configured to perform a Fouriertransformation of the spatially combined color value of at least part ofthe pixels in a tracked segment and to cluster two tracked segments intoa single cluster if their amplitude peak frequency has substantially thesame value and phase.
 5. Device as claimed in claim 1, wherein saidclustering unit is configured to compute the inner product betweennormalized time signals of different tracked segments and to clustersaid tracked segments into a single cluster if said inner productexceeds a predetermined threshold.
 6. Device as claimed in claim 1,wherein said clustering unit is configured to use the hue of pixels ofthe tracked segments as feature for clustering the tracked segments. 7.Device as claimed in claim 1, wherein said segmentation unit, trackingunit and clustering unit are configured to repeatedly performsegmentation, tracking and clustering, wherein the sequence of acquiredimage frames used for subsequent repetitions overlap in time.
 8. Deviceas claimed in claim 1, wherein said segmentation unit is configured toperform a feature point detection, in particular a Harris cornerdetection.
 9. Device as claimed in claim 1, wherein said segmentationunit is configured to perform a triangulation, in particular a Delaunaytriangulation.
 10. Device as claimed in claim 2, wherein saidsegmentation unit is configured to over-segment the image frame that isused for segmentation based on color, position and/or textureproperties.
 11. Device as claimed in claim 1, wherein said tracking unitis configured to track the segments of the segmented image frame usingmotion estimation.
 12. Device as claimed in claim 1, further comprisinga vital signs detector for detecting vital signs of a subject based onimage information from detected skin areas within said sequence of imageframes.
 13. System for skin detection comprising: an imaging unit foracquiring a sequence of image frames over time and a device for skindetection as claimed in claim 1 based on the acquired sequence of imageframes.
 14. Method for skin detection comprising: obtaining a sequenceof image frames acquired over time, segmenting an image frame of saidsequence of image frames, tracking segments of the segmented image frameover time in said sequence of image frames, clustering the trackedsegments to obtain clusters representing skin of a subject by use of oneor more image features of said tracked segments.
 15. Computer programcomprising program code means for causing a computer to carry out thesteps of the method as claimed in claim 14 when said computer program iscarried out on the computer.